🌐💬 3D-GRAND

3D-GRAND

1University of Michigan 2New York University
*Denotes Equal Contribution

Abstract

3D-grounded Conversation Generation helps alleviate hallucination in multimodal LLMs. Grounded generation also makes the generated response of 3D Large Language Models more actionable and interpretable in a physical 3D environment for embodied and robotics tasks. Recent efforts have been made to construct larger grounded conversation datasets for 2D images; however, the 3D research community currently lacks a large dataset of such kind. In this paper, we introduce the first million-scale 3D grounded conversation dataset that consists of 3.2M 3D-text pairs on 4.2k 3D scenes.

Demo

Choose a task

Choose a scene-dataset



3D-POPE Leaderboard

Model Idx
🤖 Model
Accuracy (Random Split)
Precision (Random Split)
Recall (Random Split)
Accuracy (Popular Split)
Precision (Popular Split)
Recall (Popular Split)
Accuracy (Adversarial Split)
Precision (Adversarial Split)
Recall (Adversarial Split)
1
50.00
50.00
100.00
50.00
50.00
100.00
50.00
50.00
100.00
2
50.12
50.08
77.13
50.27
50.23
77.13
50.44
50.48
77.14
3

LEO

54.03
52.70
78.52
48.86
49.28
77.44
49.77
49.85
77.67
4
86.45
87.26
85.36
80.85
78.30
85.35
81.47
78.98
85.78
5
85.68
88.22
82.34
81.69
81.32
82.28
82.10
81.72
82.72

Citation

@misc{yang2023llmgrounder,
      title={LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent}, 
      author={Jianing Yang and Xuweiyi Chen and Nikhil Madaan and Madhavan Iyengar and Shengyi Qian and David F. Fouhey and Joyce Chai},
      year={2023},
      eprint={2309.12311},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}